The replication “crisis” in medicine, and science more generally, has been the poster child for problems with how science is conducted since 2015. That is when the term first appeared, following a published attempt at replicating 100 psychological studies. The authors found that only 39% could be replicated, meaning that the other 61% were probably spurious, and the original studies false positives. As we have been pointing out since then, calling it a “crisis” is a bit dramatic, but it is a legitimate issue. Properly framed, it can be a good lens into the challenges of modern science.
A replication is a study that attempts to reproduce the findings of an earlier study. Sometimes this can be an exact replication, where the methods are followed precisely on a fresh set of data to make sure the original findings were not a statistical fluke. Other replications try to reproduce the same phenomenon, but with slightly or even completely different methods. The idea is that if the phenomenon is real it should replicate for anyone who looks for it. We can also get more technical and say that the research should find a statistical Bell curve of probability around the true effect size.
Wrong ideas, or hypotheses that are false, should also display a statistical curve of probability, which means that there are going to be false positives. So really when we talk about the replication crisis (let’s call it a “problem” from here out) the real question is – if science is functioning optimally (no fraud, no mistakes, valid methods), what should be the rate at which initial positive findings replicate with later research? It’s not 100% because not all new hypotheses in science are correct and even false hypothesis should generate positive data sometimes by random chance. Is 39%, as was found by that original psychological paper, a reasonable number? If so, then there is no problem at all and we are wringing our hands over nothing.
A recent article by Alexander Bird argues that this latter position may be close to the truth. He writes:
In this article I argue that the high rate of failed replications is consistent with high-quality science. We would expect this outcome if the field of science in question produces a high proportion of false hypotheses prior to testing.
The apparent replication problem, he concludes, is therefore just a manifestation of the base rate fallacy. This is a failure to consider the base rate at which a phenomenon occurs when thinking about how likely a specific instance should be. In some contexts this may also be called the representativeness heuristic (a cognitive bias based on the base rate fallacy). For example, anti-vaxxers may point to the fact that there are more people who are vaccinated who get COVID than people who are not vaccinated, implying that the vaccine does not work. However, this fails to consider that there are many more people who are vaccinated than unvaccinated – the base rate.
In the context of replication, the base rate fallacy results in underestimating the number of new hypotheses that are false. That really is the ultimate question here – of all new ideas in science, how many of them will turn out to be true. If that proportion is small, then most studies will be looking at an ultimately false hypothesis. Therefore the false positive studies from the wrong hypotheses may turn out to be higher than the true positive studies from the smaller proportion of correct hypotheses.
So what’s the answer – how many new hypotheses in science are actually correct? We don’t know specifically, and this likely varies widely across disciplines, cultures, and even institutions. There have been estimates, however, with answers of around 10%. That is a representative figure, but we can use it to do some thought experiments. In a world where 10% of new ideas in science are correct, and where the conduction of science is perfect, and we use a 0.05 P-value as the cut off for a “positive” study, and further all studies are statistically powerful enough to detect the phenomenon, then about a third of positive studies will be false positives and won’t replicate.
But of course we don’t live in a perfect world. The one third figure is actually lower than some studies are finding, in terms of original studies that don’t replicate. This is why some researchers argue that, even if we account for the base rate fallacy, that does not completely explain the replication problem. It does, however, put it into context and support the position that the replication problem may be overblown. Still, we should be moving toward a more optimal scientific world. How do we get there?
We have addressed this issue many times at SBM, so here is a quick summary (all proposals by other scientists and published in the literature). Some authors argue that we should lower the traditional P-value threshold for accepting a finding as “statistically significant”, from 0.05 to 0.005. This would reduce the false positive rate but increase the false negative rate – this is unavoidable, the question is, where is the optimal balance? Perhaps 0.05 is not it. The deeper question here is, are false positives more a problem for science or false negatives? Should we have a high tolerance of false positives (in which case, stop complaining about the non-existent replication problem), or are they a drag and inefficiency in the system that we should seek to minimize? On the other hand, if a false negative kills a good idea in the crib, is that a greater loss for science?
Others argue that we should stop relying so much on P-values. Journals should (and some are) requiring other measures of if a hypothesis is likely true or not, such as effect sizes, and Bayesian analysis. If we use multiple measures of probability, then perhaps we can lower both the false positive and false negative rates.
Yet another proposal is for internal replications to become standard in scientific studies. This means that researchers will test a new hypothesis with a preliminary set of data, and if positive then replicate the findings themselves with a fresh set of data. If this internal replication works, then publish. This way we can reduce some of the false positive noise in the literature without increasing the false negative rate.
This all sounds fantastic, but critics have pointed out that as a pragmatic issue there are problems here. Raising the bar significantly for getting published could be crippling for young researchers trying to get their PhD or launch their career. This would be especially onerous on marginalized groups and researchers from poorer countries or institutions.
So how do we balance all of this in our theoretically perfect scientific world? One way is to institutionalize (to an even greater degree than already is the case) scientific best practices. This should include all the things we discuss here, including preregistration of trials, eliminating behavior that results in P-hacking, and including more thorough statistical analysis rather than relying solely on P-values. Editors of scientific journals also need to stop favoring articles just for their impact factor, and publish more replications. Institutions need to support young researchers, and adjust their “publish or perish” culture that motivates publishing lots of preliminary data.
Even with these tweaks, there are going to be many published studies that are false positive and don’t replicate. It comes with the territory, as Bird pointed out, and is statistically unavoidable. Doing internal replications is great, but is unrealistic to have the lowest bar for publication being so high. As a compromise I propose that journals publish preliminary research, as they do now, but label it as such. This is a warning to journalists and other researchers – these are preliminary data about a fresh hypothesis, many or most won’t replicate, so view these results with caution. But this also gets the new hypothesis out there for others to research. Remember – whether or not a finding replicates is how we find out if a phenomenon is real. If 100% of findings replicated, then it would not be serving that purpose.
The real replication problem, therefore, may be mostly one of perception. The problem with findings that don’t replicate is not that the original study was a false positive, or the new hypothesis was wrong. Again, these are unavoidable and part of the scientific process. The problem is a media culture that presents every new finding as if it is definitely real and we should immediately change our behavior based on the new finding. This is exacerbated by social media, and by an industry of hype and snake oil looking to exploit such findings. A prominent warning about the preliminary nature of such findings may help mitigate this.